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Abstract 



This paper studies the problem of learning clusters which are consistently present 
in different (continuously valued) representations of observed data. Our setup dif- 
fers slightly from the standard approach of (co-) clustering as we use the fact that 
some form of 'labeling' becomes available in this setup: a cluster is only interest- 
ing if it has a counterpart in the alternative representation. The contribution of this 
paper is twofold: (i) the problem setting is explored and an analysis in terms of 
the PAC-Bayesian theorem is presented, (ii) a practical kernel-based algorithm is 
derived exploiting the inherent relation to Canonical Correlation Analysis (CCA), 
^sO ■ as well as its extension to multiple views. A content based information retrieval 

C*~) ' (CBIR) case study is presented on the multi-lingual aligned Europal document 

. dataset which supports the above findings. 

o 
o 

1 Introduction 

Consider the setup where individual observations come in two different representations (x, y). This 
paper focuses on the questions: Tf we observe a new x, what can be said about the corresponding y, 
and vice versa?' While this abstract problem has obvious relations to classical supervised learning, 
its inherent symmetry relates it to unsupervised learning as well. This paper studies the above 
problem, specifying the properties to be predicted in terms of pre-specified membership functions. 
Figure ([TJ differentiates the above problem - termed PairWise Cluster Analysis (PWCA) - from 
the supervised, unsupervised, semi-supervised, transfer- and multiple-task learning JT] and self- 
taught learning |2|. The present learning strategy has direct relations to co-occurrence analysis, 
co-clustering [3 1, kernel Canonical Correlation Analysis (kCCA) [4| and has been motivated by the 
previous works of Pelckmans et al. [5| and Sim et al. [6) which explore an application in relating 
text corpus - microarray expression and multi-attribute co-clustering respectively. 

The analysis given in Section 2 phrases the learning problem in terms of the PAC-Bayesian theorem, 
much in the spirit of the recent work of Seldin & Tishby [7|. Although, while the latter concerns 
density estimation for discrete variables, the presented ideas cover a spectrum of unsupervised learn- 
ing (clustering). The analysis presented in J7] concerns, essentially, the same quantity Eq\R,(K)\ 
as in subsection 2.1, equation @, which characterizes how well some hypotheses Q aligns with the 
distribution underlying the data. Our extension to pairwise clustering is fundamentally different - 
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Figure 1: Pictorial representation of different 
learning paradigms, extending the picture in [2|. 
Suppose the aim is to discriminate elephants from 
rhinos. When a picture appears in a frame, a cor- 
responding class-label is available. In cases: (a) 
supervised classification, (b) unsupervised learn- 
ing, (c) semi-supervised learning, (d) transfer 
learning (the two different colors indicate two 
different learning tasks), (e) selftaught learning, 
and (f) pairwise cluster analysis (PWCA). Note 
that in the latter we try not to find the class la- 
bels themselves, but to recover the symbiotic rela- 
tion between elephant-egret, andrhino-oxpeckers. 
Specifically, the presence of oxpeckers might help 
us in predicting the presence of a rhino, and vice 
versa. 

incorporating a notion of prediction 'loss' - while the relation of Kuller-Leibler (KL) divergence and 
the norm of an hypotheses establishes a relation with the learning algorithm. 

Section 3 (i) derives an effective learning algorithm, boiling down to a quadratic (or a generalized) 
eigenvalue problem. This learning machine is closely related to kernel Canonical Correlation Anal- 
ysis (see e.g. (8] [4[ and references therein). Empirical (ii) evidence for this learning paradigm, 
and the proposed algorithm is then presented. We proceed to demonstrated the benefit of learning 
structure within the data on a multi-lingual text-corpora [9|. Section 4 indicates a number of open 
questions. 

2 A Generic Analysis using the PAC-Bayes Theorem 

Consider a function h r : {x} — > [0, 1] that verifies, for a given problem setting, how good a certain 
'rule' r performs on a sample x. The goal of a learning algorithm is to find the best rule r in a given 
set of plausible rules (the hypothesis set). Then, learning proceeds by collecting a dataset {A^}™ =1 
of n observations assumed to be sampled independently from identical distributions (i.i.dfl The 
empirical risk 7Z n (h r ) and the actual risk 7Z(h r ) of an 'hypothesis' h r G H is defined as 

(n n (h r ) = ~Er=iM x i) m 

\K(h r )=E[h r {X)], {) 

where the expectation E[ ] concerns the fixed, unknown distribution underlying the n i.i.d observa- 
tions. For supervised learning problems, (informally) an observation x consists typically of a couple 
(z,y) with a covariate z and an 'output' y. Then h r is often rephrased as h r (x) = £(y — r(z)), 
where £ : R — > [0,1] is the 'prediction loss' between the actual observation y and its prediction 
r(z). In a Bayesian context, we assume that the hypothesis h r g H are also 'stochastic' elementfl 
possessing some notion of likelihood, say Q : % [0, 1] such that J n Q(h r )dh = 1. Consider at 
first the case where H is finite, we are interested in what happens on functions Eq[h r (x)], which is 
defined as 

E Q [h r {x)\ = h r {x)Q(h r ). (2) 

h r £H 

If \H\ is infinite, then the sum can be replaced by an integral as usual, or Eq [h r (x)} = 
J H h r (x)Q(h r )dh r . In the analysis we will assume \H\ < oo in order to avoid technical issues. 
Note that this is not quite a regular (well-known) expectation E[ ] as before. Now let the Kullback- 
Leibler distance be defined for each < p, q < 1 be defined as KL(g, p) = q log | + (1 — q) log jz^,, 

'We will use the convention to denote stochastic variables as capital letters, e.g. X,Y,..., while determin- 
istic quantities are denoted in lower case, e.g. h, f, i,x,y,n, . . . . 

2 In a PAC-Bayesian context, we will merely consider weighted sums of the elements in H, rather than 
assuming a truly Bayesian setup. 
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where log(-) denote the natural logarithm. Let the function P : H — > [0, 1] be the prior weighting 
function over H. If Q : % —> [0, 1] and P : T~L —> [0, 1] are two functions, we extend the definition 

as 

KL(Q,P) = E Q ^ lo %jM- < 3 > 

h r En 

We state the PAC-Bayes theorem as in 1 10 1: 

Theorem 1 For 5 > and for n > 8, we have that with probability exceeding 1 — S we have that 
for all Q : % — > [0, 1] the following inequality holds: 

KL [TZ n {h r )\, E Q [K(h r )\) < . (4) 



Specifically, this holds for a Q n found by an algorithm based on the n i.i.d. observations. Note that 
this result is currently the most tight inequality, refining the ideas presented in 1111 . While till date 
most applications are found in the context of supervised learning, we will argue in the following that 
this theorem finds a 'natural' application towards unsupervised learning. 

2.1 An Application of PAC-Bayes Towards Clustering 

In what follows, assume that the n i.i.d. samples {Xi}™ =1 take values in a bounded set in S C M. d 
for a given d E N. In order to use the PAC-Bayes result to the generic application of clustering, we 
need to specify the loss function I : M. d — > [0, 1] of interest. A 'cluster', represented as an indicator 
function h : R d —> {0, 1}, is understood here as a member of a user-specified set of indicator 
functions H = {h : R d -> {0, 1}}. Formally, one defines for a set c C M. 

M x)=/(, e c) = {j * x *l (5) 
Now, we look a bit closer at what the term Eq [lZ(h c )] represents in this context. 



E Q [n(h c )\ =Y, F (Xe c)Q(h c ) = 



h c £-H lh c eU 



E h c (X)Q(h c ) 



(6) 



where the second equality holds by linearity of the expectation, and where P denotes the probability 
rules underlying the data. Consequently, the term Eq[7Z(Ii)] characterizes how well Q aligns with 
the distribution underlying the data. Assume that the H is designed such that all sets c corresponding 
to a h c € H (i) cover the space S and (ii) are disjunct. 

The function P :H — >• [0,1] is the prior weighting function (think of it as a 'prior distribution' over 
%). In general, it is up to the user in a specific application to decide how to design (H, P): it is 
good practice to make it equally likely for each hypothesis h E T~L to explain the data by itself, - 
suggesting a uniform prior P over this set H- while the result should be useful for the application 
in mind. Assume for example that all probability mass (underlying the samples) concentrates in 
the set corresponding with a single h c , and Q(h c ) = I(i = j), then this measure equals 1. On 
the other hand, if all samples are equally distributed over the \H\ sets h c g H, the measure equals 
I^t. This motivates the naming of Eq\]Z{h)] as the explanatory power of (%, Q). Specifically, if 

H = {I(x 6 [—1, l] rf )}, the explanatory power of (H, Q) is 1, but it however is not very useful, 
surprising nor falsifiable. 

We argue that this PAC-Bayesian interpretation to clustering is often 'natural' because of three rea- 
sons, (i) The present analysis does not need to recover the density function underlying the data, a 
feature which is highly desirable if working with high-dimensional data, (ii) The set of 'underly- 
ing' clusters is not recovered exactly, nor assumed to exists in reality. The actual stochastic rules 
underlying the observed data only say how well the hypothesis clustering 'explains' the data. When 
dealing with data arising from complex processes the assumption of a 'true clustering' is often an 
oversimplification, (iii) The characterization of performance of the found rule Q n in terms of its de- 
viation from the prior P is desirable if clustering is meant for looking for 'consistent' irregularities. 
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Specifically, if the result Q n is not what we (more or less) expected before seeing the data, substan- 
tial empirical evidence should be presented motivating this property. Those reasons differentiate the 
approach substantially from approaches based on density estimation, or on mixtures of distributions. 
Remark that this description of explanatory power is strongly related to the ideas presented in lfl2l . 
The following clustering algorithm is then motivated by application of the PAC-Bayesian theory: 

Q n = aTgxmnE Q [n n (h)] s.t. KL(Q, P) < u n , (7) 
Q 

where oj n > 0. This objective is also motivated from an information theoretical approach to cluster- 
ing, as e.g. in Q. 



2.2 An Application of PAC-Bayes Towards Pairwise Clustering 

Now we explain how the above insights lead to an analysis of the pairwise clustering setup. Let 
again Z and Y denote respectively the two domains of interest in which pairwise observations (x, y) 
are made. A first approach would be to rephrase the pairwise clustering problem as a standard 
clustering approach, where instead of the class of indicator functions Hf C {/ : Z — >• [0, 1]} in the 
first domain, one studies the cross-product of this class with the class of indicator functions in the 
other domain T~L^ 9 —HfX H g , or 



u s - 9 c 



{h = (f h ,g h ) | f h ■ Z -> [0, l],g h : Y -> [0, 1]}. (8) 

However, the reasoning in the introduction suggests another route. To see this, we formalize the 
intuition of the pairwise observation (x, y) being a target for prediction: (i) let z G Z represent the 
part of a sample x — (z, y) which might be used to predict (a property) of the (unobserved) y E Y; 
and/or (ii) given y £ Y, predict (a property) of the corresponding (unobserved) z g Z. Given a 
set H' 9 : the knowledge of the 'cluster' to which X belongs, will be used to predict the cluster 
memberships of the corresponding y. 

We will say that fh explains z G Z if fh(z) = 1, and similarly that gh explains y £ Y if gh(y) = 1. 
In an ideal case, one would be able to associate exactly one distinct fh £ Hf to every gh € H g (i.e. 
describe a permutation). As such, one could predict the cluster gh containing y corresponding to a 
given z. In the worst case, the choice of g that explains y is independent of z being explained by 
/. The pairwise clustering setup however differs from such a multi-class classification (structured 
output prediction) task as it is essentially symmetric: a given z is used to predict (cluster membership 
of) the corresponding y, and a given y is used to predict (cluster memberships of) the corresponding 
x. Now, a pairwise cluster h = (f, g) € T-L^ 9 was useful for a sample (z, y) € Z x Y if f(z) = g(y). 
Alternatively, a pairwise cluster c = (/, g) contradicts a sample if f(z) ^ g(y)- This motivates the 
following risk function 

(K n (h) = i ELi I(MZi) + g h {Yi)) 

\n(h)=¥(f h (Z)^g h (Y)), {> 

defined again in an 'empirical' and an 'actual' flavor. This definition measures how many (for how 
large a probability mass) datapoints are contradicted by a pairwise cluster h = (fh>gh)- Now the 
term Eq\R,(K)\ becomes 

E Q [n(h)]= V(f h (Z)^g h (Y))Q(h), (10) 

which basically captures how many mistakes are made when focussing on the subset of T-L'' 9 
as directed by Q. This motivates the following practical approach: (i) given a dataset {Xi — 
(Zi,Yi)}2 =1 , with the elements taking values in Z x Y, and (ii) a a set T~L^ 9 of pairwise clusters 
represented as h — (/, g), and a 'prior' weighting function P : H?' 9 — > [0, 1], then we aim to find 
a new weighting function Q n : H?' 9 — > [0, 1] which is not too different from P, and which aligns 
well with the probability rules underlying the data as 

Q* = axgrnmi?Q(7£(/i c )) s.t. KL(Q,P)<u, (11) 
Q 

where u> > 0. The PAC-Bayes theorem now guarantees that this problem is approximatively solved 
based on the data as 

Q' n = axgzmn E Q {1l n {h c )) s.t. KL(Q,P) < w, (12) 
Q 
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where ui > 0. The resulting Q' n will emphasize the pairwise clusters which are most often consistent 
with the data. Here we have a natural trade-off between specificity and accuracy, regulated by ui n . 
If uj n were small, the solution Q' n cannot deviate from the uniform distributions over all pairwise 
clusters in W' 9 , but then many different pairwise clusters will contradict on different samples, 
leading in turn to low explanatory power. On the other hand, allowing for arbitrary Q' n will explain 
the individual samples fairly well (allowing a single pairwise cluster per sample), but the PAC- 
Bayesian result will not guarantee accuracy of the result anymore. 

We now express the 'regularization term' KL(Q, P) in a more convenient form. 

Proposition 1 (Bound to K.-L. Divergence) Assume \H\ < oo and P(h) = for all h € %, 

then 

KL(Q,P)<log^Q 2 (/i)+log(|^|). (13) 
hen 



This is a consequence of the following inequality on the entropy of a vector p g]0, l[ d with l T p = 1 

Hp) = Y^pMPd) < log fep? ) > < 14 ) 

i=l \i=l / 

by application of Jensen's inequality. Let E [0, 1]'^' be a vector representing the function Q 
where sf — Q{hi) (enumerating the different elements hi E W), then 

4= argmin ||s Q || 2 s.t. E Q [K n (h)} = 0. (15) 

implementing the socalled realizable case (as in the theory of Support Vector Machines). The op- 
timal solution Q n will try to find as many pairwise clusters as possible which are not contradicting 
the given data. We illustrate this notion in figure[2] In the ideal case, all observations are explained. 
In more realistic cases, merely a few pairwise clusters are found (i.e., the set {h E H : Q(h) > 0} 
contains only a few elements). 




Figure 2: Schematic representation of all pairwise clusters in a hypothesis space % based on the 5 
disjunct intervals d + [0, 0.2] in either domain (dotted lines). The dots (X, Y) E R x K represent 
samples from an underlying distribution. Suppose the different hypothesis can be factorized as 
h c = (f, g), where / : R — > [0,1] and g : R — > [0, 1], being the corresponding indicator functions in 
either domain. This means that there are 25 possible different pairwise clusters h c (dotted squares), 
or = 25, (a) about 70% of the observations (dots) do not contradict the 5 pairwise clusters 

(yellow squares) simultaneously; (b) Only one sample ('□') contradicts the shown pairwise cluster 
h c (yellow squares), while the other two ('o' and ' x ') are consistent with h c . 

We extend this model to account for infinite %, defined as h = (5 Z , 5 y ) for each (z, y) E Z x Y, 
and where 5 X denotes the Dirac delta. When extending the formulation in order to deal with infinite 
hypothesis spaces T-L^ 9 , we replace vectors sq by functions Q : H — > R + , which (for convenience) 
are assumed to be elements of a Hilbert spaces H. This space is equipped with a corresponding 
inner-product (reproducing kernel) k : H x H — > R, implicitly defining H and P. Note that 
Q(h) > for all h E T-L, and j u Q(h)dh — 1. This motivates the replacement of the term KL(Q, P) 
by ||Q||h- As such (fT2l is equivalent (up to normalization) to 

Q'n = argmin ||Q|| H s.t. E Q [R n {h)} = 0. (16) 

Q 

where Q"(h) > for all h E H, and j H Q'^{h)dh = 1. Note that for the majority of pairwise 
clusters no data is sampled contradicting the cluster, and a smooth transition of Q inbetween the 
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sample becomes possible. In the remainder we will assume the relevant Hilbert space H can be 
decomposed additively uniquely as ® Hy, and the norm of a function Q can then be written 
as ||Q||h = ||F||jj„. + 1 1 G 1 1 h v ■ Assume W^ 9 contains all pairwise clusters h — [5 z ,5 y ) for all 
(z, y) £ Z x Y and 6 the Dirac delta. Under the assumtion no ties occur in the data, problem dTTb is 

(F„,G„) = argmin \\F\\ 2 Hz + \\G\\ 2 Hv s.t. F; = G it Vi = 1, . . . ,n. (17) 

F,G 

enforcing that F{h) = G{h) for all h € H f ' 9 , and enforcing again that F(h) >0 for all h € 'H / ' s as 
well as that J K/ , 9 F{h)dh = 1. Here F = F^J and G^yJ = Q((S Zi , S n )) for alii = l,...,n. 
The next section shows how to solve this problem, relaxing the (in)equality constraints. 



3 Kernel PairWise Component Analysis 
3.1 PWCA for paired Observations 

This section studies how the learning problem ( fTTI i is solved (approximatively) by an efficient algo- 



rithm. Let X a = (X{ 



t£xm 



and Y" = (Y/ , . . . , Y„ Y € W xn be matrices where ^ 
is the number of samples and m, n are the number of attributes/features for the first and second rep- 
resentation respectively. The functions Q are parametrised as F Vc (z) = v^Tz and G We (y) = wjy. 
The inequalities > are enforced by representing this as Q(h) — F 2 (/) = G 2 (h) for 

all h £ H?' 9 . This is imposed by enforcing c s ; = y/Q{{Sz~, JyJ) = F((5y) = G(5yJ, then 
/ w Q(h)dh — 1 is enforced by imposing the constraint c'c = 1 (similarly, maximizing c'c). As 
such dT2l > becomes 

c'c - 7(w^,w c + V c v c ), (18) 



max 

,v c eR m ,w c ei 



where A' is the transpose of matrix, or vector, A and such that c^ = X a ,iW c = Y&.iV c , for i = 
1, . . . ,£. Associating Lagrange multipliers a>i, Pi to each of the £ constraints gives the following 
Lagrangian 



C = I c 'c- J(w' c w c + v / c v c )-a / (c-X w c )-)9 / (c-y6V c ). 



(19) 



Taking derivatives of equation ( fT9l with respect to w c , v c , c and setting to zero give the following 
conditions for optimality as 

— = -> w c = -JS^a, — = -> v c = —Y b f3, — = -> c = (a + /3). 
ow c 7 av c 7 ac 

Setting back into the optimisation in equation ( TT~8b gives the following dual problem 

max J=h a + 0)\ a +/3) - ^-{a'K a a + /3'X b /3), 

where X Q = X a X' a and K b = Y b Y b ' are the kernel matrices. Taking derivatives and setting to zero 
shows that J achieves a (local) optimum when 

dj 
da 
dj 

8(3 








7(0 + 0) = K a a 
7 (a +0) = K b (3. 



(20) 



We are able to observe that at optimum K a ot = K b /3, which illustrates a direct relationship to 
KCCA condition. Due to limited space we do not explore the relationship to KCCA within the 
scope of this manuscript. Equation d20l > can be rewritten as 



(21) 



where Ig is the identity matrix and Of is a matrix of zeros, both of size £ X I. This equation may be 
solved as a generalized eigenvalue problem in the form of Ax = XBx. Alternatively, we observe 



X 


0e~ 




a 


= 7 


h 


h 




a 


o e 


K b _ 




A 


h 


h 




A 



6 



that by setting f3 = \ 7jK a — Ij a, we can express ^K a a. = -^KbKaOt — ^K b ot, which results in 
the following generalized eigenvalue problem for a 

K b K a a = 1 {K a + K b )a, (22) 

and by setting R to be the Cholesky decomposition of K b K a such that K b K a = RR' we obtain the 
following symmetric eigenvalue problem 

ha = 1 R- 1 {K a + K b )R- 1 a. 

It may be necessary to regularize equation ( f2Tl > with some small value r on the diagonal. This will 
result in our optimisation being rewritten as 

J<(1 + t) h 



X 0/ 




a 


= 7 


0t K b 




A 





a 




A 



rlt ] a and 



Furthermore, the above eigenvalue problem can be written as (3 = [hjK a 

K b K a a = 7 2 (A - r 2 I e )a + j{rhK a + Tl t K b )a, 

which can be solved as a quadratic eigenvalue problem. It follows from the conditions for optimality 
that a new sample (x a , yb) can be projected in the learnt semantic space by the functions 

J F(x a ) = w^,x a = ia'A a (x a ,x a ), 
\G(y b )=v' c y b = ±P'K b (y b ,y b ). 

Then it is also reasonable to assign the sample (x a , y b ) to the cluster (!,...,£) which has highest 
(absolute) factors |F(x a )|j and IG^y;,)^ respectively. 



3.2 PWCA for Multiview Observations 

In this section we generalize our methodology to multiple views. Expressing optimization in equa- 
tion ([T8T l for three sources gives 



1 



max — c c 

,w c gl m ,v c gl",z c 6M» 2 
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(w^w c 



z r z c 



(23) 



such that Ci — X a- iW c = X b ,i\r c = X c ^z c , for i = 1, . . . ,£. Taking derivatives of equation i 
with respect to w c , v c , z c , c and setting to zero will give the conditions for optimality. Substituting 
these conditions back into equation (l23l gives the following dual problem 

max J = ~(a + (3 + v)'(a + /3 + v) - ^-(a'K a a + p'K b p + v'K c u), 

aeR f ,/3eRVeR £ 2 27 

where K a — X a X' a , K b = X b X' b and K c = X C X' C are the kernel matrices. Taking derivatives and 
setting to zero shows that J achieves a (local) optimum when 

^ = ^ ^a+fi+u) = K a oc, -± 
da op 

which can be rewritten as 

K a e 
0, K b e 
. Of 0, K c 

where again A is the identity matrix and 0, is a matrix of zeros, both of size £ x £. Therefore, without 
loss of generality, we can extend this to multiple i = 1, . . . , s views, where s > 2, similarly to the 
previously proposed multi-view extension for CCA by [8 |, such that 



-> 7 ( a +/3+i/) = K b (3, 



dj 
dv 



-y(a+/3+v) = K c v. 
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This equation may be solved as a generalized eigenvalue problem in the form of Ax = ABx. 
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4 Experiments of PWCA on Europal 



We proceed to compare PWCA to KCCA for a mate -retrieval task lfl3l [T4l [T31 [TBI , i.e. given a 
document query in language x to retrieve the (exact) matching document in the paired language 
y. For this purpose we use the multi-lingual Europal dataset |9|, which has a total of 1 1968 aligned 
documents. We use the following eight languages with the number of features/words in brack- 
ets; da - Danish (78720), de - German (153499), en - English (60369), es - Spanish (171821), it - 
Italian (66548), nl - Dutch (105318), pt - Portuguese (66922) and sv - Swedish (51116). We use 
linear kernels throughout and arbitrarily set the regularization parameter to r = 0.01 for both meth- 
ods. Finally, the performance is evaluated using Average Precision (AP) [17| which is computed 
as AP = j 53i=i 77 where Ii is the rank location of the exact paired document for query docu- 
ment <\i. Therefore AP = 0.5 indicates that the paired document is on average situated at location 
7 = 2. We select the rank by sorting the, absolute, inner products values of F(q.;)'G(yj) (as well 
as for F(Ki)'G(qj)) for all possible paired test documents, i.e. we rank the retrieved documents 
according to their similarity (in the learnt space) with our query. In our experiments we use the 
CCA formulation as proposed by [ 8 1 for both pair- and multi-view. 

In the first of our two experiments, for each pairing combination of languages, we randomly select 
500 paired-documents for training and 5000 for testing. The analysis has been repeated 10 times and 
averaged across. The results given in table [TJ are the AP averaged across of all possible language- 
pair combinations for the language indicated in the column (i.e. column da is the average of all the 
language pairing with da - xx). We are able to observe that PWCA is able to perform, on average, 
on a par with KCCA. The mean AP across all languages for KCCA is 0.4435 whereas for PWCA it 
is 0.4459. 



Table 1 : We compare KCCA and PWCA on a bilingual mate-retrieval task (see text for language 
abbreviation). The reported results are the AP for retrieving the exact paired document in another 
language, averaged across all possible language-pair combination for the language indicated in the 
column. The results are averaged over 10 repeats of the analysis. 





da 


de 


en 


es 


it 


nl 


Pt 


sv 


KCCA 
PWCA 


0.4174 
0.4294 


0.3839 
0.4416 


0.4979 

0.4747 


0.4243 
0.4344 


0.4572 

0.4368 


0.4023 
0.4111 


0.4939 

0.4679 


0.4714 
0.4716 



In the second experiment we extend the previous analysis to a trilingual mate-retrieval task, i.e. 
we train on an aligned document corpus from three languages whereas during testing we compute 
the mean average precision of all the individual pair-wise mate-retrieval tasks (of the three lan- 
guages). In other words, we train on the trilingual alignment of da-de-en while we test the query 
retrieval on the bilingual task of da-de, da-en, de-en. In this experiment we randomly select 500 
tripartite-documents for training and 2000 for testing. Due to increased complexity we only repeat 
the analysis, for each 3 language combination, once. The results given in tabled as in the previous 
table, are the mean average precision for the language stated in the column and all its possible tripar- 
tite combinations (without repetition, i.e. for example; da-da-en is not be allowed). We are clearly 
able to see the improvement gained by PWCA over KCCA despite increasing the training alignment 
complexity. Furthermore, not only did the added aligned language not hinder the mate retrieval task, 
it improved performance as visible when comparing table Q] with table [2] 



Table 2: We compare KCCA and PWCA on a trilingual mate -retrieval task (see text for language 
abbreviation). The reported results are the mean average precision for retrieving the exact paired 
document in another language for all possible tripartite combinations of the language stated in the 
column (without repetition) for training. 





da 


de 


en 


es 


it 


nl 


Pt 


sv 


KCCA 
PWCA 


0.3687 
0.5407 


0.3290 
0.5155 


0.3930 
0.5427 


0.3742 
0.5394 


0.3792 
0.5310 


0.3501 
0.5246 


0.3917 
0.5406 


0.3909 
0.5504 
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CCA (and KCCA) does not seek to maintain any pre-existing structure within the views while seek- 
ing to maximise correlation across the views. This aspect that may lead to over-fitting when having 
multiple views, PWCA addresses this by directly seeking to maintain internal structure by trying to 
find as many pairwise (or n-wise) clusters as possible which do not contradict the given data. We 
hypothesis that the PWCA performance improvement is a direct result of the clustering condition. 

5 Discussion 

This study presented a novel learning paradigm and corresponding algorithm that aims at finding 
structure (pairwise clusters) in paired (multi-view) observations. A case study on bilingual and 
trilingual mate-retrieval task, and a motivation using the PAC-Bayesian results are given. While this 
paper described a theoretical as well as applied proof of concept, many issues including efficiency, 
out-of-sample extensions and relations to other techniques remain. 
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